Sample of 10 clusters. From each cluster a subsample of N sequences has been recovered, with N:
Then, an another subsample was done taking in account the length of sequences. A sequence is sampled if it has at least len% of the maximum length of the cluster (and so on with a maximum size of 500 sequences), with len%:
from glob import glob
import pandas as pd
import re
from Bio import SeqIO
import Levenshtein as lv
import plotly.express as px
work_dir = "/home/lerat/TE_Aalb/test/cdhit"
sampled_clst_dir = sorted(list(glob(work_dir+"/cluster_*", recursive = True)))
clst_type = ["100", "150", "200", "250", "300", "350", "400", "450", "500", "20%", "40%", "60%", "80%", "100%"]
info = ["nb_seq", "cons_length", "div"]
columns = ["cluster"]
for t in clst_type:
for i in info:
columns.append(i + "-" + t)
refiner = pd.DataFrame(columns=columns)
sizes = [100, 150, 200, 250, 300, 350, 400, 450, 500]
lengths = [2, 4, 6, 8, 10]
for clst in sampled_clst_dir:
row = []
cluster_name = clst.split("/")[-1]
row.append(cluster_name)
cons_500 = clst + "/cluster-500.clst.fasta.refiner_cons"
for seq_record in SeqIO.parse(cons_500, "fasta"):
seq_500 = seq_record.seq
for s in sizes:
sample_name = clst + "/cluster-" + str(s) + ".clst.fasta"
nb_seq = 0
with open(sample_name, "r") as f:
data = f.read()
nb_seq = len(re.findall(r'>', data))
row.append(nb_seq)
cons_file = clst + "/cluster-" + str(s) + ".clst.fasta.refiner_cons"
for seq_record in SeqIO.parse(cons_file, "fasta"):
cons_len = len(seq_record.seq)
row.append(cons_len)
dist = 1 - lv.ratio(str(seq_500), str(seq_record.seq))
row.append(dist)
for l in lengths:
sample_name = clst + "/cluster-" + str(l) + ".clst.fasta"
nb_seq = 0
with open(sample_name, "r") as f:
data = f.read()
nb_seq = len(re.findall(r'>', data))
row.append(nb_seq)
cons_file = clst + "/cluster-" + str(l) + ".clst.fasta.refiner_cons"
for seq_record in SeqIO.parse(cons_file, "fasta"):
cons_len = len(seq_record.seq)
row.append(cons_len)
dist = 1 - lv.ratio(str(seq_500), str(seq_record.seq))
row.append(dist)
refiner.loc[-1] = row
refiner.index = refiner.index + 1
refiner = refiner.sort_index()
refiner = refiner.set_index('cluster')
Each row is a sampled cluster. Each cluster was sampled with 100, 150, 200, 250, 300, 350, 400, 450 and 500 sequences. For each sample we have compute the number of sequences (it can be less then expected if the cluster size is smaller than the sample size), the length of the generated consensus (using Refiner) and the distance of the consensus with the consensus generated using the sample a 500 sequences. Then each cluster was sampled by the sequences length, taking in account sequences having at least the 20%, 40%, 60%, 80% or the 100% of the longest sequence in the cluster, to reach 500 sampled sequences (when possible). As before, we have compute the number of sequences, the consensus length and the distance to the sample with 500 sequences.
refiner
| nb_seq-100 | cons_length-100 | div-100 | nb_seq-150 | cons_length-150 | div-150 | nb_seq-200 | cons_length-200 | div-200 | nb_seq-250 | ... | div-40% | nb_seq-60% | cons_length-60% | div-60% | nb_seq-80% | cons_length-80% | div-80% | nb_seq-100% | cons_length-100% | div-100% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||||||||||||||
| cluster_9 | 100 | 446 | 0.126984 | 150 | 446 | 0.126984 | 200 | 446 | 0.130952 | 250 | ... | 0.791651 | 2 | 4636 | 0.784148 | 2 | 4636 | 0.784148 | 1 | 4805 | 0.791690 |
| cluster_8 | 100 | 202 | 0.212871 | 150 | 202 | 0.000000 | 200 | 202 | 0.000000 | 250 | ... | 0.080569 | 1 | 543 | 0.476510 | 1 | 543 | 0.476510 | 1 | 543 | 0.476510 |
| cluster_7 | 100 | 1403 | 0.076156 | 150 | 1493 | 0.056506 | 200 | 1492 | 0.057494 | 250 | ... | 0.021529 | 1 | 3509 | 0.410131 | 1 | 3509 | 0.410131 | 1 | 3509 | 0.410131 |
| cluster_6 | 100 | 900 | 0.001668 | 119 | 899 | 0.000000 | 119 | 899 | 0.000000 | 119 | ... | 0.007795 | 37 | 897 | 0.007795 | 12 | 897 | 0.008909 | 1 | 899 | 0.414905 |
| cluster_5 | 100 | 147 | 0.000000 | 150 | 147 | 0.000000 | 200 | 147 | 0.000000 | 250 | ... | 0.944064 | 1 | 5109 | 0.944064 | 1 | 5109 | 0.944064 | 1 | 5109 | 0.944064 |
| cluster_4 | 100 | 147 | 0.314149 | 150 | 147 | 0.314149 | 200 | 260 | 0.022642 | 250 | ... | 0.473451 | 1 | 625 | 0.472626 | 1 | 625 | 0.472626 | 1 | 625 | 0.472626 |
| cluster_3 | 100 | 1433 | 0.612737 | 150 | 1869 | 0.532497 | 200 | 1862 | 0.533615 | 250 | ... | 0.060198 | 1 | 5947 | 0.066420 | 1 | 5947 | 0.066420 | 1 | 5947 | 0.066420 |
| cluster_2 | 100 | 1562 | 0.001601 | 127 | 1561 | 0.000000 | 127 | 1561 | 0.000000 | 127 | ... | 0.000000 | 16 | 1603 | 0.014539 | 5 | 1610 | 0.016083 | 1 | 1608 | 0.026191 |
| cluster_1 | 100 | 1164 | 0.493554 | 126 | 3180 | 0.000000 | 126 | 3180 | 0.000000 | 126 | ... | 0.021760 | 1 | 3162 | 0.021760 | 1 | 3162 | 0.021760 | 1 | 3162 | 0.021760 |
| cluster_0 | 100 | 4102 | 0.393554 | 150 | 6571 | 0.004099 | 150 | 6571 | 0.004099 | 150 | ... | 0.028089 | 3 | 6534 | 0.028089 | 1 | 6527 | 0.029855 | 1 | 6527 | 0.029855 |
10 rows × 42 columns
sampled_seq = refiner[['nb_seq-100', 'nb_seq-150', 'nb_seq-200', 'nb_seq-250', 'nb_seq-300', 'nb_seq-350', \
'nb_seq-400', 'nb_seq-450', 'nb_seq-500', 'nb_seq-20%', 'nb_seq-40%', 'nb_seq-60%', \
'nb_seq-80%', 'nb_seq-100%']]
sampled_seq
| nb_seq-100 | nb_seq-150 | nb_seq-200 | nb_seq-250 | nb_seq-300 | nb_seq-350 | nb_seq-400 | nb_seq-450 | nb_seq-500 | nb_seq-20% | nb_seq-40% | nb_seq-60% | nb_seq-80% | nb_seq-100% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster | ||||||||||||||
| cluster_9 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 500 | 4 | 4 | 2 | 2 | 1 |
| cluster_8 | 100 | 150 | 200 | 250 | 256 | 256 | 256 | 256 | 256 | 236 | 3 | 1 | 1 | 1 |
| cluster_7 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 500 | 593 | 6 | 1 | 1 | 1 |
| cluster_6 | 100 | 119 | 119 | 119 | 119 | 119 | 119 | 119 | 119 | 45 | 40 | 37 | 12 | 1 |
| cluster_5 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 500 | 1 | 1 | 1 | 1 | 1 |
| cluster_4 | 100 | 150 | 200 | 250 | 300 | 350 | 372 | 372 | 372 | 128 | 38 | 1 | 1 | 1 |
| cluster_3 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 451 | 9 | 4 | 1 | 1 | 1 |
| cluster_2 | 100 | 127 | 127 | 127 | 127 | 127 | 127 | 127 | 127 | 123 | 109 | 16 | 5 | 1 |
| cluster_1 | 100 | 126 | 126 | 126 | 126 | 126 | 126 | 126 | 126 | 15 | 1 | 1 | 1 | 1 |
| cluster_0 | 100 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 37 | 3 | 3 | 1 | 1 |
cons_length = refiner[['cons_length-100', 'cons_length-150', 'cons_length-200', 'cons_length-250', \
'cons_length-300', 'cons_length-350', 'cons_length-400', 'cons_length-450', \
'cons_length-500', 'cons_length-20%', 'cons_length-40%', 'cons_length-60%', \
'cons_length-80%', 'cons_length-100%']]
cons_length = cons_length.T
cons_length
| cluster | cluster_9 | cluster_8 | cluster_7 | cluster_6 | cluster_5 | cluster_4 | cluster_3 | cluster_2 | cluster_1 | cluster_0 |
|---|---|---|---|---|---|---|---|---|---|---|
| cons_length-100 | 446 | 202 | 1403 | 900 | 147 | 147 | 1433 | 1562 | 1164 | 4102 |
| cons_length-150 | 446 | 202 | 1493 | 899 | 147 | 147 | 1869 | 1561 | 3180 | 6571 |
| cons_length-200 | 446 | 202 | 1492 | 899 | 147 | 260 | 1862 | 1561 | 3180 | 6571 |
| cons_length-250 | 444 | 202 | 1495 | 899 | 147 | 270 | 2678 | 1561 | 3180 | 6571 |
| cons_length-300 | 562 | 202 | 1494 | 899 | 147 | 270 | 2810 | 1561 | 3180 | 6570 |
| cons_length-350 | 562 | 202 | 1343 | 899 | 147 | 270 | 2810 | 1561 | 3180 | 6604 |
| cons_length-400 | 562 | 202 | 1343 | 899 | 147 | 270 | 2809 | 1561 | 3180 | 6572 |
| cons_length-450 | 562 | 202 | 1343 | 899 | 147 | 270 | 3244 | 1561 | 3180 | 6571 |
| cons_length-500 | 562 | 202 | 1604 | 899 | 147 | 270 | 5947 | 1561 | 3180 | 6603 |
| cons_length-20% | 4804 | 202 | 1601 | 897 | 5109 | 147 | 5948 | 1561 | 3169 | 6704 |
| cons_length-40% | 4804 | 220 | 1601 | 897 | 5109 | 634 | 5947 | 1561 | 3162 | 6534 |
| cons_length-60% | 4636 | 543 | 3509 | 897 | 5109 | 625 | 5947 | 1603 | 3162 | 6534 |
| cons_length-80% | 4636 | 543 | 3509 | 897 | 5109 | 625 | 5947 | 1610 | 3162 | 6527 |
| cons_length-100% | 4805 | 543 | 3509 | 899 | 5109 | 625 | 5947 | 1608 | 3162 | 6527 |
fig = px.line(cons_length)
fig.show()
cons_div = refiner[['div-100', 'div-150', 'div-200', 'div-250', 'div-300', 'div-350', 'div-400', 'div-450', \
'div-500', 'div-20%', 'div-40%', 'div-60%', 'div-80%', 'div-100%']]
cons_div = cons_div.T
cons_div
| cluster | cluster_9 | cluster_8 | cluster_7 | cluster_6 | cluster_5 | cluster_4 | cluster_3 | cluster_2 | cluster_1 | cluster_0 |
|---|---|---|---|---|---|---|---|---|---|---|
| div-100 | 0.126984 | 0.212871 | 0.076156 | 0.001668 | 0.000000 | 0.314149 | 0.612737 | 0.001601 | 0.493554 | 0.393554 |
| div-150 | 0.126984 | 0.000000 | 0.056506 | 0.000000 | 0.000000 | 0.314149 | 0.532497 | 0.000000 | 0.000000 | 0.004099 |
| div-200 | 0.130952 | 0.000000 | 0.057494 | 0.000000 | 0.000000 | 0.022642 | 0.533615 | 0.000000 | 0.000000 | 0.004099 |
| div-250 | 0.379722 | 0.000000 | 0.054534 | 0.000000 | 0.000000 | 0.003704 | 0.444638 | 0.000000 | 0.000000 | 0.003188 |
| div-300 | 0.001779 | 0.000000 | 0.054229 | 0.000000 | 0.000000 | 0.003704 | 0.363252 | 0.000000 | 0.000000 | 0.003264 |
| div-350 | 0.003559 | 0.000000 | 0.098744 | 0.000000 | 0.000000 | 0.000000 | 0.362567 | 0.000000 | 0.000000 | 0.000681 |
| div-400 | 0.005338 | 0.000000 | 0.098744 | 0.000000 | 0.000000 | 0.000000 | 0.361809 | 0.000000 | 0.000000 | 0.004023 |
| div-450 | 0.000000 | 0.000000 | 0.100102 | 0.000000 | 0.000000 | 0.000000 | 0.296268 | 0.000000 | 0.000000 | 0.004099 |
| div-500 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| div-20% | 0.791651 | 0.000000 | 0.003432 | 0.007795 | 0.944064 | 0.309353 | 0.058092 | 0.000000 | 0.018113 | 0.039603 |
| div-40% | 0.791651 | 0.080569 | 0.021529 | 0.007795 | 0.944064 | 0.473451 | 0.060198 | 0.000000 | 0.021760 | 0.028089 |
| div-60% | 0.784148 | 0.476510 | 0.410131 | 0.007795 | 0.944064 | 0.472626 | 0.066420 | 0.014539 | 0.021760 | 0.028089 |
| div-80% | 0.784148 | 0.476510 | 0.410131 | 0.008909 | 0.944064 | 0.472626 | 0.066420 | 0.016083 | 0.021760 | 0.029855 |
| div-100% | 0.791690 | 0.476510 | 0.410131 | 0.414905 | 0.944064 | 0.472626 | 0.066420 | 0.026191 | 0.021760 | 0.029855 |
fig = px.line(cons_div)
fig.show()